A Prototype Personal Dictation System

نویسنده

  • Adam Janin
چکیده

We describe a prototype personal dictation system. As a user speaks, the system produces a real-time audio transcript. The user can correct and annotate the transcript using a graphical user-interface (UI) running on a handheld computer. The speech recognition runs on a workstation. Although the two are currently connected via a wired network, a wireless connection is planned. The primary focus of this paper is the UI for correcting the transcript. INTRODUCTION We are in the process of developing Meeting Recorder, a portable device that records meetings in uninstrumented, natural environments. The Meeting Recorder will support multiple speakers, allow correction and annotation of the transcript, support indexing and searching of the audio record, and will be self-contained using Vector IRAM. The full Meeting Recorder project is very ambitious, involving research in automatic speech recognition (ASR), speaker and topic tracking, information retrieval, collaboration, annotations, and small form-factor UIs. It also requires a chip that will not be available for another year or so. To get a handle on some of the infrastructure and UI issues of the Meeting Recorder project, we developed an intermediate testbed, a Personal Dictation System. It allows a single user to dictate text in real time. The text appears on the screen of a Palm Pilot (a handheld computer). The user can then correct the transcript using a pen interface. Limited annotation is also allowed. SYSTEM ARCHITECTURE Since the Pilot is not yet capable of providing speech recognition, the ASR system runs on a workstation connected to the Pilot via a network. Also, we used a headset microphone to limit the problems associated with background noise and reverberation. Although both the headset microphone and the wired network require the user to be tethered to a workstation, we will soon lift this restriction by providing a wireless network and microphone. The ICSI hybrid ASR system was used to perform the speech recognition. Although accuracy and throughput of the ICSI system are very good, it was not designed to be interactive. This caused some problems with user interaction. In addition, the ICSI system was trained on television and radio news broadcasts. Therefore, the system does much better with a news-like vocabulary and speaking style. The UI runs on the Pilot. It allows the user to correct the transcripts and create new text. The components running on the Pilot communicate with the components running on the workstation using TCP/IP. In fact, three workstations were used. One captured the audio signal and performed signal processing, the second ran the ASR algorithms and the correction server (see below), and another was used to connect the Pilot to the network though the Pilot’s cradle. CORRECTING AND ANNOTATING We distinguish annotation from correction. Correcting is the process of informing the recognizer that it has made an error. For example, if you say “a record day” and the transcript reads “the records pay”, you may want to inform the system that it is “day” rather than “pay”. Annotation, on the other hand, allows the user to change or add to the transcript. Even if the recognition is perfect, the user may want to modify the results or add additional marks. For the purposes of the Personal Dictation System, non-textual annotations (e.g. circling, underlining) were not supported. Correction is useful for two reasons. First, if the recognizer has a good idea of the possible alternatives, it may be much faster to select one of these rather than deleting the incorrect text and then entering the correct text. Secondly, the recognizer can adapt to the user more efficiently if the correct transcript is provided. Additionally, the correction mechanism allows out-of-vocabulary words to be added. The system must allow the user to specify a portion of the transcript to be corrected; generate alternatives to the selection; and update the transcript. The following paragraphs detail some of the proposed solutions, and the results of (very) informal user studies. The first attempted method of specifying an incorrect portion of the transcript involved selecting the incorrect text using a standard press-and-drag method followed by tapping on a “Correct It” button at the bottom of the screen. This method was the easiest to implement, as the Pilot directly supports it. However, the interaction was slow, especially for inaccurate sections of the transcript. Also, the “Correct It” button takes up valuable screen real estate. Next, we tried a method inspired by the Pilot’s popup triggers. When the user presses and holds on a word, a popup list appears with alternatives for the word. A drawback of this paradigm is that it requires separate “Annotate” vs. “Correct” modes. Another drawback is that only a word can be specified, rather than entire phrases. Regardless, this interaction was preferred over the previous method. The press-and-hold method had another problem as well. Frequently, a user would tap on a word rather than pressand-hold. Therefore, the final method chosen for specifying a recognition error was to tap on a word. The system for generating alternatives and updating the transcript requires some understanding of how transcripts are produced. As the ASR system looks for possible matches to what the user said, it generates a lattice of words. Figure 1 shows an example excerpt from a lattice in which the user said “a record day”. Each path from the root node to a leaf represents a hypothesized utterance. Each hypothesis receives a score. When ranked in order, the hypotheses produce an “N-best” list of possible utterances. The transcript is generated by selecting the hypothesis with the highest score (the first element of the N-best list). In this case, the transcript might read “the records pay”.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On the development of a dictation machine for Spanish: DIVO

∗ This work has been partially supported by INSERSO (Ministerio de Asuntos Sociales) and CYCIT grant #TIC94-0119. ABSTRACT The first prototype of a low cost dictation machine for Spanish is described (DIVO). The main characteristics of our recognition approach are: bottomup, hypothesis-verification strategy; large vocabulary, speaker dependent, isolated word recognition. Its modular structure i...

متن کامل

A medical rehabilitation diagnoses transcription method that integrates continuous and isolated word recognition

This paper describes a practical dictation system that is able to compensate for the lack of a large text database and reports on the results of eld tests in which the system was used to make medical rehabilitation diagnosis reports in a hospital. Our dictation system uses two recognition engines: continuous speech recognition and isolated word (including connected words) recognition engines. W...

متن کامل

Towards an automatic dictation system for translators : the transtalk project

Professional translators often dictate their translations orally and have them typed afterwards. The TransTalk project aims at automating the second part of this process. Its originality as a dictation system lies in the fact that both the acoustic signal produced by the translator and the source text under translation are made available to the system. Probable translations of the source text c...

متن کامل

Automatic Grading Prototype System for KANJI Dictation Test

This paper presents an automatic grading prototype system developed as recognition engine for Japanese KANJI Dictation Test which aims to certificate the reading and writing ability of KANJI characters. The system is designed to replace the conventional human-grading process. Different from general handwritten character recognition systems which allow to read incorrect (misspelled) characters, ...

متن کامل

Saffron a Prototype Example for Evidence Based Herbal Medicine

Evidence-based medicine is now generally perceived to be the dominant operating system in conventional medicine. Evidence-based medicine developed concurrently with the internet and the world wide web. This is no coincidence since evidence-based medicine suggests a personal responsibility for clinicians to keep abreast of research that would be difficult without the information access that the ...

متن کامل

User Expectations from Dictation on Mobile Devices

Mobile phones, with their increasing processing power and memory, are enabling a diversity of tasks. The traditional text entry method using keypad is falling short in numerous ways. Some solutions to this problem include: QWERTY keypads on phone, external keypads, virtual keypads on table tops (Seimens at CeBIT ‘05) and last but not the least, automatic speech recognition (ASR) technology. Spe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999